Introduction

This practical session illustrates elements of Microbial Metagenomics data analysis, and a Microbial Biomarker development. Specifically, it includes:

  • Exploration, manipulation and visualization of microbial metagenomic data using Phyloseq package,
  • Statistical methods for Alpha and Beta diversity assessment (including PERMANOVA with vegan::adonis2),
  • Differentially abundant taxa detection with ANCOMBC2,
  • Building a Random Forest model to predict the former land use by the microbial community.

It is expected that you already have some general experience with R and R markdown (including ggplot and dplyr), and some experience with machine learning in R (including caret package and random forest analysis).

The dataset is provided by the RestREco project, which is currently running in Cranfield under the lead of Prof. Jim Harris: https://restreco.com/

The RestREco project studies ecosystems after prior industrial or agricultural use. The main idea behind the project is that we will never be able to restore ecosystems to their prior state (e.g. because of the climate change, population increase or the technology developments). So, we need to develop some new criteria to manage ecosystem restoration: for instance restore to some sufficiently complex and resilient state (so called “restoration to complexity”).

Soil microbiome is just one of the many aspects considered by the project. Soil samples were collected from tens of sites in Scotland and in England, which had been previously exposed to intensive Industrial (quarrying) or Agricultural use. The dataset selected for this practical includes only woodland; additionally including a small number of ancient woodland sites. The soil microbial community was assessed using 16S-amplicon sequencing. The code provided in to you below is focusing on the effect of the Prior Land Use: Industrial vs Agricultural sites. You are expected to follow this example, but modify the code to compare the Regions: Scotland vs England.

The practical session aims to give a very broad overview of popular microbial metagenomics tools in R. So, it may be challenging to complete this practical if you are new to the microbial analysis and/or if your prior R and ML experience is limited. So don’t be discouraged if you don’t complete everything within the given time! Work with your own pace: you may keep this script and the data for a reference to use later if/when you need.

Good Luck & Happy Coding!

Dataset summary

Prepare environment

Installing required packages

These packages should be installed once, hence the installation commands are commented.

It would be a good idea to install these packages in advance, so you may focus on the actual data analysis during the practical session.

# CRAN

# install.packages("vegan")
# install.packages("Polychrome")
# install.packages(dendextend) 
# install.packages("ggplotify")
# install.packages("parallelMap")
# install.packages("caret")
# install.packages("randomForest")

# Bioconductor

# install.packages("BiocManager")
# BiocManager::install("phyloseq")
# BiocManager::install("ANCOMBC")
# BiocManager::install("mixOmics")

# GitHub

# install.packages("devtools")
# devtools::install_github("jbisanz/qiime2R")

Time stamp & initial clean-up

date()
## [1] "Tue Mar 19 20:12:10 2024"
rm(list=ls())
graphics.off()

Loading selected packages

These are the generic packages used in may sections of the code. The remaining packages will be loaded when needed.

library(dplyr)
library(ggplot2)

Import of QIIME2 data to Phyloseq

The source data for this analysis was prepared in QIIME2 (see more details about QIIME2 and the data preparation in the lecture). The QIIME2 data format is called “QIIME2 artifacts”, and has the QZA file extension: for QIIME2 Zip Archive. As you may see from the file extension, it’s just a Ziped archive containing some data (such as counts matrix in BIOM format, or sequences in FASTQ format, or phylogenetic tree, or taxa names, etc) together with the technical information about how the data was generated (so called “provenance information”). You may unzip QIIME2 artifacts to see their content (optional!).

One of the most popular data formats for ecology metagenomics data in R was developed developed in Phyloseq R package (more details in the lecture). It provides a container for joint storage of feature/counts table, taxonomy, phylogeny, samples metadata, and may even keep the representative sequences for the OTUs/ASVs present in the dataset.

In this practical we will import QZA files to R using qza_to_phyloseq() function from qiime2R package. We will not use the representative sequences in our analysis, so they will not be added to the Phyloseq object in this example.

An alternative way of importing QZA to Phyloseq could be

  • using QIIME2 functions to export QZA to BIOM, followed by
  • reading BIOM using Phyloseq functions.

However, the Phyloseq and QIIME2 often assume slightly different versions of BIOM specification, which may require some additional and tedious BIOM data tuning. On my experience, qiime2R package is the most robust way of importing QZA to Phyloseq (in 2024).

# Required library
library(qiime2R) 

# Source files: assuming files in "data" sub-folder and 
# using "\\" for Windows path separator
feature_table_qza <- "data\\feature_table.qza"
rooted_tree_qza <- "data\\rooted_tree.qza"
taxonomy_qza <- "data\\taxonomy.qza"
metadata_tsv <- "data\\samples.txt"

# Read data
data.phy <- qza_to_phyloseq(
    features = feature_table_qza,
    tree = rooted_tree_qza,
    taxonomy = taxonomy_qza,
    metadata = metadata_tsv
    )

# Check result
data.phy
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 60355 taxa and 330 samples ]
## sample_data() Sample Data:       [ 330 samples by 11 sample variables ]
## tax_table()   Taxonomy Table:    [ 60355 taxa by 7 taxonomic ranks ]
## phy_tree()    Phylogenetic Tree: [ 60355 tips and 60236 internal nodes ]
# Cleanup
# Here I included removal of not needed objects, packages and cleaning the RAM
# Later I will opnly remuva not needed objects. 
rm(feature_table_qza, rooted_tree_qza, taxonomy_qza, metadata_tsv) # remove unnecessary objects
detach("package:qiime2R", unload=TRUE) # unload qiime2R package
gc() # free unused R memory
##            used  (Mb) gc trigger  (Mb) max used  (Mb)
## Ncells  5229507 279.3   10184331 544.0 10184331 544.0
## Vcells 30170113 230.2   88785406 677.4 92120484 702.9

Exploring Phyloseq object

Phyloseq package provides many helpful functions to explore microbial data.

Summary of Phyloseq object

# Load phyloseq and check its version
library(phyloseq) 
packageVersion("phyloseq")
## [1] '1.44.0'
# High-level content of the Phyloseq object
data.phy 
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 60355 taxa and 330 samples ]
## sample_data() Sample Data:       [ 330 samples by 11 sample variables ]
## tax_table()   Taxonomy Table:    [ 60355 taxa by 7 taxonomic ranks ]
## phy_tree()    Phylogenetic Tree: [ 60355 tips and 60236 internal nodes ]

Explore Metadata

dim(sample_data(data.phy)) 
## [1] 330  11
sample_variables(data.phy)
##  [1] "Novogene_ID"    "Site_code"      "Site_Code_Plot" "Site_name"      "Plot_no"        "Cranfield_code" "Region"         "Former_landuse" "Woodland_age"   "pH"             "Conductivity"
head(sample_names(data.phy))
## [1] "WP001" "WP002" "WP003" "WP004" "WP005" "WP006"
sample_data(data.phy)[1:5,1:5]
##            Novogene_ID Site_code Site_Code_Plot       Site_name Plot_no
## WP001 FKDN230153473-1A      SW01        SW01_P1 Blackridge_Farm       1
## WP002 FKDN230153474-1A      SW01        SW01_P2 Blackridge_Farm       2
## WP003 FKDN230236203-1A      SW01        SW01_P3 Blackridge_Farm       3
## WP004 FKDN230153476-1A      SW01        SW01_P4 Blackridge_Farm       4
## WP005 FKDN230153477-1A      SW01        SW01_P5 Blackridge_Farm       5
# How many sites per Land Use and Region? 
table(sample_data(data.phy)[,c("Region","Former_landuse")])
##           Former_landuse
## Region     Agriculture Ancient_woodland Industrial Rewilding
##   England           75               10         75        10
##   Scotland          75               10         75         0

Explore Feature Table

dim(otu_table(data.phy)) 
## [1] 60355   330
otu_table(data.phy)[1:10,1:5]
## OTU Table:          [10 taxa and 5 samples]
##                      taxa are rows
##                                  WP001 WP002 WP003 WP004 WP005
## 90dd28446d3cd23e7707d9d5c2fb4df8     0     0     0     0     0
## 78be05943c92bdd31e9a872ee909279c     0     0     0     0     0
## d6b5940830252894a33e69c8ff9c1a23     0     0     0     0     0
## d037d048afbc9e092d04fa76c860d23d     0     0     0     0     0
## 94456191991e1095be8219ca16e0413e     0     0     0     0     0
## 741481502fb4f4eeba9211d66e4bbda0     0     0     0     0     0
## 8142f3b14a5160a88689c0568dd975dc     0     0     0     0     0
## 16da85f532caa5d7a53f71f09a518c46     0     0     0     0     0
## d7f4617ac1639732ce98499f296f9c08     0     0     0     0     0
## 2b101b9e356663a2581709b5b0981f3e     0     0     0     0     0
# Extract to matrix, and check sparsity
otu_table.mx <- as.matrix(otu_table(data.phy)) 
sum(otu_table.mx==0)/(nrow(otu_table.mx)*ncol(otu_table.mx)) # sparsity: 98%
## [1] 0.981901
# Clean-up
rm(otu_table.mx)

Explore Taxonomy

# Ranks in the dataset
rank_names(data.phy) 
## [1] "Kingdom" "Phylum"  "Class"   "Order"   "Family"  "Genus"   "Species"
# Taxonomy table
dim(tax_table(data.phy))
## [1] 60355     7
tax_table(data.phy)[1:10,]
## Taxonomy Table:     [10 taxa by 7 taxonomic ranks]:
##                                  Kingdom       Phylum            Class                 Order                  Family                     Genus            Species
## 90dd28446d3cd23e7707d9d5c2fb4df8 "d__Bacteria" "Patescibacteria" "Paceibacteria"       "Paceibacterales"      "Staskawiczbacteraceae"    "GWA2-37-10"     NA     
## 78be05943c92bdd31e9a872ee909279c "Unassigned"  NA                NA                    NA                     NA                         NA               NA     
## d6b5940830252894a33e69c8ff9c1a23 "d__Bacteria" "Patescibacteria" NA                    NA                     NA                         NA               NA     
## d037d048afbc9e092d04fa76c860d23d "d__Bacteria" "Patescibacteria" NA                    NA                     NA                         NA               NA     
## 94456191991e1095be8219ca16e0413e "d__Bacteria" "Patescibacteria" NA                    NA                     NA                         NA               NA     
## 741481502fb4f4eeba9211d66e4bbda0 "d__Bacteria" "Patescibacteria" NA                    NA                     NA                         NA               NA     
## 8142f3b14a5160a88689c0568dd975dc "d__Bacteria" NA                NA                    NA                     NA                         NA               NA     
## 16da85f532caa5d7a53f71f09a518c46 "d__Bacteria" "Patescibacteria" NA                    NA                     NA                         NA               NA     
## d7f4617ac1639732ce98499f296f9c08 "d__Bacteria" "Proteobacteria"  "Alphaproteobacteria" "Rhizobiales_A_504705" "Xanthobacteraceae_503485" "Bradyrhizobium" NA     
## 2b101b9e356663a2581709b5b0981f3e "Unassigned"  NA                NA                    NA                     NA                         NA               NA
# Extract to data frame
taxa.df <- as.data.frame(tax_table(data.phy)) 

# What Kingdoms we have in the dataset (plain R)
table(taxa.df$Kingdom)
## 
##  d__Archaea d__Bacteria  Unassigned 
##         161       60173          21
# Number of different Phyla in the dataset (dplyr)
n_distinct(taxa.df$Phylum)
## [1] 67
# 7 most abundant Bacterial Phyla (dplyr)
taxa.df %>% 
  filter(Kingdom == "d__Bacteria") %>% 
  count(Phylum, sort=T) %>% 
  head(7)
##             Phylum     n
## 1   Proteobacteria 12460
## 2  Planctomycetota  9108
## 3 Actinobacteriota  6492
## 4  Acidobacteriota  5815
## 5             <NA>  5250
## 6    Chloroflexota  3550
## 7     Bacteroidota  3296
# Clean-up
rm(taxa.df)

Explore Phylogeny

# Make a small Phyloseq object (for a simple toy phylogenetic tree)
myTaxa <- taxa_names(data.phy)[200:210] # Select some taxa: you may select others if you wish :)
myTaxa.phy = prune_taxa(myTaxa, data.phy) # Make phyloseq object
myTaxa.phy # Show summary of the Phyloseq object
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 11 taxa and 330 samples ]
## sample_data() Sample Data:       [ 330 samples by 11 sample variables ]
## tax_table()   Taxonomy Table:    [ 11 taxa by 7 taxonomic ranks ]
## phy_tree()    Phylogenetic Tree: [ 11 tips and 10 internal nodes ]
# Plot a phylogenetic tree
plot_tree(myTaxa.phy, label.tips = "Family", 
          color = "Class", size = "abundance",
          plot.margin = 0.5, ladderize = TRUE, 
          title="Example of a phylogeny tree")

# Clean-up
rm(myTaxa, myTaxa.phy)

Manipulating Phyloseq object

After exploring the data, Phyloseq also provides masny functions to manipulate the data: filter, merge, agglomerate etc (by taxa or by samples).

Filtering and merging

Note that in Phyloseq Merging/ Agglomerating is not the same as Filtering/ Sub-setting/ Pruning.

In the chunk below we will agglomerate taxa. A similar merging operation exist for samples (it will be illustrated later, when we make taxonomic barplots).

# Reminder of a summary of the source data
data.phy
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 60355 taxa and 330 samples ]
## sample_data() Sample Data:       [ 330 samples by 11 sample variables ]
## tax_table()   Taxonomy Table:    [ 60355 taxa by 7 taxonomic ranks ]
## phy_tree()    Phylogenetic Tree: [ 60355 tips and 60236 internal nodes ]
# Keep only samples with Industrial or Agricultural prior use
table(sample_data(data.phy)$Former_landuse)
## 
##      Agriculture Ancient_woodland       Industrial        Rewilding 
##              150               20              150               10
IndAgri.phy <- subset_samples(data.phy, 
                              Former_landuse %in% c("Industrial","Agriculture"))
IndAgri.phy # Number of samples dropped from 330 to 300
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 60355 taxa and 300 samples ]
## sample_data() Sample Data:       [ 300 samples by 11 sample variables ]
## tax_table()   Taxonomy Table:    [ 60355 taxa by 7 taxonomic ranks ]
## phy_tree()    Phylogenetic Tree: [ 60355 tips and 60236 internal nodes ]
# Reminder: what kingdoms we have in the data?
table(as.data.frame(tax_table(data.phy))$Kingdom)
## 
##  d__Archaea d__Bacteria  Unassigned 
##         161       60173          21
# Keep only Bacteria (remove Archaea and Unclassified)
BacIndAgri.phy <- subset_taxa(IndAgri.phy, Kingdom == "d__Bacteria")
BacIndAgri.phy # The OTU count dropped from 60,355 to 60,173
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 60173 taxa and 300 samples ]
## sample_data() Sample Data:       [ 300 samples by 11 sample variables ]
## tax_table()   Taxonomy Table:    [ 60173 taxa by 7 taxonomic ranks ]
## phy_tree()    Phylogenetic Tree: [ 60173 tips and 60054 internal nodes ]
# Agglomerate Bacteria to Class level: to reduce the size of dataset for analysis
# (may take ~5min)
BacIndAgriClass.phy <- tax_glom(BacIndAgri.phy, taxrank="Class")
BacIndAgriClass.phy # Taxa count dropped to 156
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 156 taxa and 300 samples ]
## sample_data() Sample Data:       [ 300 samples by 11 sample variables ]
## tax_table()   Taxonomy Table:    [ 156 taxa by 7 taxonomic ranks ]
## phy_tree()    Phylogenetic Tree: [ 156 tips and 155 internal nodes ]
# Note that all ranks are preserved, but the data set to NA 
rank_names(BacIndAgriClass.phy) 
## [1] "Kingdom" "Phylum"  "Class"   "Order"   "Family"  "Genus"   "Species"
head(tax_table(BacIndAgriClass.phy))
## Taxonomy Table:     [6 taxa by 7 taxonomic ranks]:
##                                  Kingdom       Phylum               Class                    Order Family Genus Species
## 7b973191300cc536ed0e864d139df732 "d__Bacteria" "Desulfobacterota_D" "UBA1144"                NA    NA     NA    NA     
## 92e4bb6e35dc20054c1d2c8f3d3b0439 "d__Bacteria" "Desulfobacterota_I" "Desulfovibrionia"       NA    NA     NA    NA     
## 2c372459c9d3749b757c83aa4ccae93c "d__Bacteria" "Desulfobacterota_I" "Syntrophobacteria"      NA    NA     NA    NA     
## 93d4a957de027db558e39b26776cab7b "d__Bacteria" "Desulfobacterota_C" "Deferrisomatia"         NA    NA     NA    NA     
## 20a5bdc4157cf62cae00ba7af03ef579 "d__Bacteria" "Verrucomicrobiota"  "Kiritimatiellae_777934" NA    NA     NA    NA     
## 67cd63792e78e5c0624f90683f990e78 "d__Bacteria" "Verrucomicrobiota"  "Verrucomicrobiae"       NA    NA     NA    NA
# Keep only taxa (classes) with counts >10 in at least 12 samples
# x in filter_taxa() stands for the rows of the feature table 
# (rows = taxa, columns = samples)
BacIndAgriClassAbund.phy <- filter_taxa(BacIndAgriClass.phy,
                                        function(x) sum(x >= 10) >= 12,
                                        prune=TRUE)
BacIndAgriClassAbund.phy # Taxa count dropped to 76
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 76 taxa and 300 samples ]
## sample_data() Sample Data:       [ 300 samples by 11 sample variables ]
## tax_table()   Taxonomy Table:    [ 76 taxa by 7 taxonomic ranks ]
## phy_tree()    Phylogenetic Tree: [ 76 tips and 75 internal nodes ]
# Explore sparsity after agglomeration
otu_table.mx <- as.matrix(otu_table(BacIndAgriClassAbund.phy)) # extract matrix
sum(otu_table.mx==0)/(nrow(otu_table.mx)*ncol(otu_table.mx)) # sparsity: 34%
## [1] 0.3372807
# Clean-up
rm(data.phy, IndAgri.phy, BacIndAgri.phy, BacIndAgriClass.phy, otu_table.mx)

Normalization

Phyloseq function transform_sample_counts() allows applying a function over samples. The function takes the x argument, which refers to a column within the feature table (remember: taxa= rows, samples=columns).

transform_sample_counts() may be used for normalization, such as scaling or centering per sample.

For simplicity, in this tutorial we will normalize by total count per sample. Then we scale the fractions by the median, and round to integers because some of the downstream functions expect integer counts, rather than fractions.

An alternative commonly used way of normalizing microbial data is rarefaction: equalizing depth in samples by random removal of data from the samples with higher depth. In R, the rarefaction curves could be plottred using vegan::rarecurve(), then the rarefaction may be applied using phyloseq::rarefy_even_depth().

It’s important to remember that after combining samples or filtering taxa the normalization should be repeated.

# Counts per sample before normalization
plot(colSums(otu_table(BacIndAgriClassAbund.phy)), 
     ylab="Count", xlab="Samples", xaxt="n", 
     main="Counts per sample before normalization")

# Calculate median count per sample before normalization (33,183)
countsPerSample <- colSums(otu_table(BacIndAgriClassAbund.phy))
medianCount <- median(countsPerSample)

# Normalize by total count per sample
BacIndAgriClassAbundNorm.phy  = transform_sample_counts(BacIndAgriClassAbund.phy, 
  function(x) x / sum(x))

# Scale and round (to have integer counts instead of fractions)
BacIndAgriClassAbundNorm.phy  = transform_sample_counts(BacIndAgriClassAbundNorm.phy, 
  function(x) round(x*medianCount))

# Counts per sample after normalization
plot(colSums(otu_table(BacIndAgriClassAbundNorm.phy)), 
     ylab="Count", xlab="Samples", xaxt="n", 
     main="Counts per sample after normalization",
     ylim=c(min(countsPerSample),max(countsPerSample)))

# Clean-up
rm(BacIndAgriClassAbund.phy, countsPerSample, medianCount)

Visualising and analysis

Taxonoimic composition

Visualizing taxonomy is an important part of exploring microbial data.

Simple bar plots

Phyloseq provides plot_bar() function that outputs a ggplot object that can be directly piped to the geom_bar() layer.

# Select 10 first samples (for making the toy plots clearer)
MySamples <- sample_names(BacIndAgriClassAbundNorm.phy)[1:10]
MyData.phy <- prune_samples(MySamples, BacIndAgriClassAbundNorm.phy)
MyData.phy
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 76 taxa and 10 samples ]
## sample_data() Sample Data:       [ 10 samples by 11 sample variables ]
## tax_table()   Taxonomy Table:    [ 76 taxa by 7 taxonomic ranks ]
## phy_tree()    Phylogenetic Tree: [ 76 tips and 75 internal nodes ]
# Make taxonomy barplot
plot_bar(MyData.phy, fill = "Phylum") +
  geom_bar(aes(color=Phylum, fill=Phylum), stat="identity", position="stack") + 
  theme(legend.position = "bottom") + 
  ggtitle("Taxonomy (Phyla) per sample") +
  guides(fill=guide_legend(ncol=3))

# Clean-up
rm(MySamples)

Combining low-abundant taxa

Combining low-abundant taxa may make the barplot clearer.

Unfortunately, plot_bar() does not provide such option. The code below illustrates how it can be done (modified from the Phyloseq author post https://github.com/joey711/phyloseq/issues/494 ).

It includes 3 steps:

  • Converting Phyloseq object to a large data frame using psmelt() function
  • Substituting rare phyla to “others” (e.g. Phyla with count <1%)
  • Using the updated data frame for the ggplot stacked barplot
# Convert Phyloseq object to large data frame
MeltData.df <- psmelt(MyData.phy)
MeltData.df[1:5,1:12]
##                                  OTU Sample Abundance      Novogene_ID Site_code Site_Code_Plot       Site_name Plot_no Cranfield_code   Region Former_landuse Woodland_age
## 733 f898c7a8c566d0beb46d5448dde2387c  WP003     11758 FKDN230236203-1A      SW01        SW01_P3 Blackridge_Farm       3         CU0003 Scotland    Agriculture           18
## 739 f898c7a8c566d0beb46d5448dde2387c  WP007     10093 FKDN230284296-1A      SW02        SW02_P2       Cassafuir       2         CU0007 Scotland    Agriculture           50
## 738 f898c7a8c566d0beb46d5448dde2387c  WP009      9179 FKDN230153481-1A      SW02        SW02_P4       Cassafuir       4         CU0009 Scotland    Agriculture           50
## 732 f898c7a8c566d0beb46d5448dde2387c  WP006      8565 FKDN230153478-1A      SW02        SW02_P1       Cassafuir       1         CU0006 Scotland    Agriculture           50
## 441 8768efd36962598377bd60d7c6d576db  WP009      8492 FKDN230153481-1A      SW02        SW02_P4       Cassafuir       4         CU0009 Scotland    Agriculture           50
# Get counts per phylum
Phyla.df <- MeltData.df %>% 
  group_by(Phylum) %>% 
  summarise(Count=sum(Abundance))

Phyla.df
## # A tibble: 37 × 2
##    Phylum             Count
##    <chr>              <dbl>
##  1 Acidobacteriota    72806
##  2 Actinobacteriota   26413
##  3 Armatimonadota        86
##  4 Bacteroidota        9683
##  5 Bdellovibrionota_E   457
##  6 Chlamydiota          432
##  7 Chloroflexota       5695
##  8 Cyanobacteria         67
##  9 Dependentiae         134
## 10 Desulfobacterota_B  4412
## # ℹ 27 more rows
# Select the cut-off (e.g. 1% of total count)
cutoff <- 0.01 * sum(MeltData.df$Abundance)

# Select low-abundant Phyla (with total counts below the cutoff)
lowAbundant <- Phyla.df[Phyla.df$Count <= cutoff,]$Phylum

# Substitute Phylum names to "<1%" for the low-abundant phyla
MeltData.df[MeltData.df$Phylum %in% lowAbundant,]$Phylum <- '<1%'

# Make stacked barplot 
ggplot(MeltData.df, aes(x=Sample, y=Abundance, fill=Phylum)) +
  geom_bar(aes(), stat="identity", position="stack") +
  theme(legend.position="bottom") + 
  ggtitle("Taxonomy (Phyla) per sample") +
  guides(fill=guide_legend(nrow=3)) + 
  theme(axis.text.x = element_text(angle = 90)) +
  labs(y= "Relative abundance", x = NULL) +
  theme(legend.position = "bottom") + 
  guides(fill=guide_legend(ncol=3))

# Clean-up
rm(MyData.phy, MeltData.df, Phyla.df, cutoff, lowAbundant)

Combining samples

Phyloseq function merge_samples() allows to combine samples, so we can compare groups of samples. In this example we will combine soil samples by the Prior Land Use.

Important: after merging samples, the data should be re-normalised!

# Merge samples in groups (warnings are OK) 
landUse.phy <- merge_samples(BacIndAgriClassAbundNorm.phy, "Former_landuse")
landUse.phy
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 76 taxa and 2 samples ]
## sample_data() Sample Data:       [ 2 samples by 11 sample variables ]
## tax_table()   Taxonomy Table:    [ 76 taxa by 7 taxonomic ranks ]
## phy_tree()    Phylogenetic Tree: [ 76 tips and 75 internal nodes ]
# Re-Normalize after merging samples !
landUse.phy  = transform_sample_counts(landUse.phy, function(x) x / sum(x))

# Plot relative taxonomic abundances
plot_bar(landUse.phy, fill = "Phylum") + 
  geom_bar(aes(color=Phylum, fill=Phylum), stat="identity", position="stack") + 
  theme(legend.position = "bottom") +
    ggtitle("Taxonomy (Phyla) per land use") +
  guides(fill=guide_legend(ncol=3))

# Clean-up
rm(landUse.phy)

Optional task: Make barplot combining low-abundant taxa from the landUse.phy

Other taxonomy plots

Combining Phyloseq and ggplot2 allow making many other taxonomy plots. Below is just a couple of examples: standard (non-stacked) barplots and heatmaps.

Non-stacked barplots

# Select a Phylum (e.g. Acidobacteriota)
acidobacteriota.phy <- subset_taxa(BacIndAgriClassAbundNorm.phy, 
                                   Phylum == "Acidobacteriota")
acidobacteriota.phy
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 8 taxa and 300 samples ]
## sample_data() Sample Data:       [ 300 samples by 11 sample variables ]
## tax_table()   Taxonomy Table:    [ 8 taxa by 7 taxonomic ranks ]
## phy_tree()    Phylogenetic Tree: [ 8 tips and 7 internal nodes ]
# Plot relative abundance of classes within this Phylum
plot_bar(acidobacteriota.phy, x="Class", fill = "Class") +
  geom_bar(aes(color=Class, fill=Class), stat="identity", position="stack") +
  ggtitle("Classes in Acidobacteriota Phylum") + 
  theme(legend.position = "none")

# Clean-up
rm(acidobacteriota.phy)

Heatmaps

See ?distanceMethodList for available distances.
See ?ordinate about available ordination methods.

plot_heatmap(BacIndAgriClassAbundNorm.phy, 
             method = "PCoA", distance = "unifrac",
             taxa.label = "Class", taxa.order = "Phylum", 
             na.value="white", low="beige", high="red", 
             title="Heatmap", max.label = 100)

It could be noted that R has a wide range of libraries for plotting hetamps, e.g. see some examples here: https://www.datanovia.com/en/lessons/heatmap-in-r-static-and-interactive-visualization/
You dont have to use Phyloseq::plot_heatmap() function to make heatmaps with microbial data. You may use Phyloseq just to prepare and extract the feature table, and then apply any other R heatmap package of your choice.

Alpha diversity

Alpha diversity describes the richness and evenness of microbial community within a single sample.

Plots

There are many measures of Alpha diversity. Here we plot Chao1, Shannon and Simpson indices. Use ?plot_richness to see what other measures are available in Phyloseq.

plot_richness(BacIndAgriClassAbundNorm.phy, 
              measures=c("Chao1", "Shannon", "Simpson"),
              color="Former_landuse", x="Former_landuse", 
              title="Alpha diversity vs Former Land Use") + 
  geom_boxplot()

Estimating statistical significance

In the plot above you may see the trend for a higher Alpha diversity in Post-Industrial than in Post-Agricultural soils. Was this difference significant?
Because of the compositionality (relative nature) and high sparsity of the standard statistical tests assuming normality of distributions usually are not appropriate for microbial data. Instead, it is recommended to use non-parametric or specialized and permutation-based tests.

In this example we only compare two groups: Agriculture vs Industrial land use. So, we just apply Wilcoxon Rank Sum test, which is appropriate for comparing two groups. If there were more than two groups to compare (e.g. if we kept the Rewilding and Ancient Woodland samples) the pairwise.wilcox.test() or pairwise Kruskall-Wallis test could be used including the appropriate adjustments for multiple testing.

Importantly: these tests are implemented outside of Phyloseq package.

# Calculate Alpha diversity indices
alpha_richness = estimate_richness(
  BacIndAgriClassAbundNorm.phy, measures = c("Chao1", "Shannon", "Simpson"))

head(alpha_richness)
##       Chao1 se.chao1  Shannon   Simpson
## WP001    50        0 2.344960 0.8547996
## WP002    49        0 2.392025 0.8640416
## WP003    42        0 2.420025 0.8390369
## WP004    55        0 2.682299 0.8971910
## WP005    60        0 2.712531 0.8952492
## WP006    47        0 2.307757 0.8426350
# Apply Wilcoxon test 
wilcox.test(alpha_richness$Shannon~
              sample_data(BacIndAgriClassAbundNorm.phy)$Former_landuse)
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  alpha_richness$Shannon by sample_data(BacIndAgriClassAbundNorm.phy)$Former_landuse
## W = 8142, p-value = 3.527e-05
## alternative hypothesis: true location shift is not equal to 0
# Clean-up
rm(alpha_richness)

Is there a significant difference in soil microbial Alpha diversity depending on the prior land use in the RestREco dataset?

Unconstrained ordination

Term “ordination” term in ecology analysis is used for dimentionality reduction and visualization (e.g. techniques similar to PCA). The idea is to make a 2D (or 3D) image that would reflect similarity between samples (or taxa), in such a way that similar samples (or taxa) are located close on the image; while the dissimilar samples (or taxa) are drown far apart.

Calculating unconstrained ordination

Technically, an (unconstrained) ordination procedure includes two main steps:

  • calculating distances between samples, and
  • making a plot to reflect these distances in 2D image in a best possible way.

In the example below we use Bray-Curtis distance and PCoA method for making the plot. See help to the ordinate() function (?ordinate) to find information about other available distances and methods.

In this practical we will consider the unconstrained ordination only. The constrained ordination includes an additional step that allows to add some environmental gradients (e.g. soil pH) to the plot: so we can see how the samples (or taxa) are positioned along these gradients.

# Calculating the ordination (will be used later for plotting)
ord <- ordinate(BacIndAgriClassAbundNorm.phy, 
                distance = "bray",
                method = "PCoA")

Taxa ordination plot

On this plot, the taxa located close to each other are frequently found in the same samples. In my experience, plot_ordination() doesn’t handle the colours well, so in the chunk below I manage the colours manually.

# --- Make colours for Phyla --- #

# Get names of Phyla present in the dataset
phyla <- unique(data.frame(tax_table(BacIndAgriClassAbundNorm.phy))$Phylum)

# Make a colour for each phyla 
library(Polychrome)
phyla_cols <- createPalette(length(phyla), c("#0000FF", "#FF0000"))
names(phyla_cols) <- phyla
#phyla_cols
#swatch(phyla_cols)

# --- Make Plot --- #
plot_ordination(BacIndAgriClassAbundNorm.phy, ord, type="taxa",
                color="Phylum", label="Class") + 
  ggtitle("Taxa ordination: Classes", 
          subtitle="Coloured by Phyla, PCoA with Bray-Curtis distances")+ 
  geom_point(size=3) + 
  scale_color_manual(values = phyla_cols) +
  theme_bw() +
  theme(legend.position = "bottom") +
  guides(color=guide_legend(ncol=3))

# Clean-up
rm(phyla, phyla_cols)

Beta diversity

Plotting with Phyloseq

Beta diversity, in effect, is an unconstrained ordination of samples. Phyloseq allows to plot it using the same plot_ordination() function that has been used above for plotting taxa. Of course, for making the ordination plot for samples you should set type=“samples”. The samples located closely on the plot have similar microbial communities, the samples located far apart have very distinct microbial communities.

plot_ordination(BacIndAgriClassAbundNorm.phy, ord, type="samples",
                color="Former_landuse") + geom_point(size=3) +
  stat_ellipse() +
  ggtitle("Beta diversity", 
          subtitle="Unconstrained ordination of samples, PCoA with Bray-Curtis distances")+ 
  theme_bw()

# Clean-up
rm(ord)

Judging by this Beta diversity plot: are Post-Industrial and Post-Agricultural microbial communities different?

Hierarchical clustering

Beta diversity may also be visualized using hierarchical clustering dendrograms. The code chunk below illustrates how to plot such deprograms using plain R methods (with minimal use of Phyloseq functions).

Hierarchical clustering

# Calculate distance matrix (using phyloseq)
bray_dist <- distance(BacIndAgriClassAbundNorm.phy, method="bray")
as.matrix(bray_dist)[1:5,1:5]
##            WP001      WP002     WP003      WP004      WP005
## WP001 0.00000000 0.09555522 0.2874471 0.21405539 0.18252772
## WP002 0.09555522 0.00000000 0.2352294 0.15354710 0.16339692
## WP003 0.28744707 0.23522941 0.0000000 0.18898148 0.19497310
## WP004 0.21405539 0.15354710 0.1889815 0.00000000 0.07829782
## WP005 0.18252772 0.16339692 0.1949731 0.07829782 0.00000000
# Perform hierarchical clustering (outside phyloseq)
bray_clust <- hclust(bray_dist, method="average")

Plot dendrogram

The plotting uses R functions outside of Phyloseq.

# For colouring dendrograms
library(dendextend) 

# Extract metadata from Phyloseq object
metadata <- data.frame(sample_data(BacIndAgriClassAbundNorm.phy))

# Make the dendrogram object
bray_dend <- as.dendrogram(bray_clust, hang=0.1)

# Get samples ordered as in the deprogram
dend_samples <- get_leaves_attr(bray_dend, "label")

# Prepare colours foe samples (by former land use)
landuse_colours <- c('green','red')[
  match(metadata$Former_landuse, c('Agriculture','Industrial'))]
names(landuse_colours) <- rownames(metadata)
landuse_colours <- landuse_colours[dend_samples]

# Set colours of terminal branches (h=1)
bray_dend <- color_branches(bray_dend, 
                            k=length(dend_samples), h=1,  
                            landuse_colours)

# Draw plot
plot(bray_dend, main="Soil amples (coloured by former land use)", 
     ylab="Distances", leaflab = "none")

# Add legend
legend("topright",
       legend=c('Agriculture','Industrial'),
       col=c('green','red'), 
       bty="n",lty=1, cex=0.8)

# Clean-up
rm(bray_dend, dend_samples, bray_clust, landuse_colours)

On the plot above you may note that, overall, the samples are grouped into separate clusters, with Agriculture and Industrial samples not mixing much with each other. However this is clearly less illustrative than the samples separation shown on 3D visualization plots in the lecture.

Statistical testing with Vegan::Adonis2()

The 3D, 2D plots, and even the dendrogram above suggested that, on average, the soil microbial communities may differ between Post-Agricultural and Post-Industrial samples. As it was noted earlier, the microbial data requires specialized tests, which are implemented outside of the Phyloseq package. The most common test used for estimating statistical significance for Beta diversity is PERMANOVA (PERmutation based Multivariate ANOVA) as implemented in adonis2() function from the vegan package.

Vegan package is as popular as Phyloseq in analysis of ecology data. However, Vegan includes more statistical tests, and puts less focus on ggplot2-compartible visualization or data manipulation (see https://vegandevs.github.io/vegan )

In the example below you may note this formula within the vegan::adonis2() function:
~Region/Former_landuse/Site_code.
This is a way of describing multi-layer nested designs in R.

Also, you may note the strata=Region term: it does not allow permutations between regions.

More detailed discussion of restricted permutations and nested designs is beyond the scope of this introductory course (although some references are given below).

library(vegan)

adonis2(bray_dist ~ Region/Former_landuse/Site_code,
        strata = metadata$Region,
        data = metadata, permutations=1000)
## Permutation test for adonis under reduced model
## Terms added sequentially (first to last)
## Blocks:  strata 
## Permutation: free
## Number of permutations: 1000
## 
## adonis2(formula = bray_dist ~ Region/Former_landuse/Site_code, data = metadata, permutations = 1000, strata = metadata$Region)
##                                  Df SumOfSqs      R2        F   Pr(>F)    
## Region                            1   2.4290 0.21067 274.8635 0.000999 ***
## Region:Former_landuse             2   2.4437 0.21194 138.2635 0.000999 ***
## Region:Former_landuse:Site_code  56   4.5363 0.39344   9.1665 0.000999 ***
## Residual                        240   2.1209 0.18395                      
## Total                           299  11.5298 1.00000                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
rm(metadata, bray_dist)

This result of PERMANOVA test shows that all three studied factors have significant effect on the microbial community: the Region, the Prior Land Use, and the Sites within each Region/ Land Use group. The R2 column shows the relative contributions of each factor into the total variance of the microbial communities.

Additional optional reading about nested designs and restricted permutations in R

Refresher about nested designs

https://www.nature.com/articles/nmeth.3137

About R formulas for nested designs

https://stat.ethz.ch/R-manual/R-devel/library/stats/html/formula.html
https://www.jstor.org/stable/2346786

Breifly:
If factor B is nested within factor A (e.g. we have multiple sites within each type of the land use), then the typical ANOVA (or regression) R formula could be written in the follwoing way:
A + B %in% A This means that we are interested in effect of factor A abd factor B nested within A.
An equivalent way of writing this formula is
A + A:B
For R experts, it highlights that, in fact, we a looking for effects of A and interaction of A with B.
The syntax A/B (used in our code) is a shortcut for A + B %in% A

About vegan::adonis2() and restricted permutations

https://uw.pressbooks.pub/appliedmultivariatestatistics/chapter/permanova/
https://uw.pressbooks.pub/appliedmultivariatestatistics/chapter/restricting-permutations/

Network plots

In addition to the plots illustrating Taxonomy, Phylogeny, Alpha and Beta diversity, Phyloseq may also draw the network plots. In the network plots the nodes (that could be samples or taxa) are connected by edges. The edges thickness indicates the degree of similarity.

Importantly, the location of the nodes in the network plot does not always reflect their similarity. While many algorithms attempt to place similar nodes close, the other options are also frequently used. In phyloseq::plot_net() function, the location of nodes is defined by the “laymeth” parameter, which may take the following values: “auto”, “random”, “circle”, “sphere”, “fruchterman.reingold”, “kamada.kawai”, “spring”, “reingold.tilford”, “fruchterman.reingold.grid”, “lgl”, “graphopt” and “svd”.

As usual, Phyloseq only provides the plotting functions. Further quantitative network analysis, like assessment of connectivity, centrality, etc would require other specialized tools.

Samples network

The plot for the samples network below suggests that soil replicas from the same sites were similar to each other.

plot_net(BacIndAgriClassAbundNorm.phy, type = "samples", 
         distance = "bray", maxdist = 0.1, 
         color="Site_code", point_label="Site_code") +
  ggtitle("Samples (colored by Sites)")

Taxa networks

The taxa co-abundances networks should be interpreted with caution: the connections between taxa on the plot should NOT be taken as an evidence of biological or physical interaction.

# Taxa plot with default layout of nodes
plot_net(BacIndAgriClassAbundNorm.phy, type = "taxa", 
         distance = "(A+B-2*J)/(A+B)", maxdist = 0.1, 
         color="Phylum", point_label="Class") +
  ggtitle("Phyla (default layout)")

# Taxa plot with circular location of nodes
plot_net(BacIndAgriClassAbundNorm.phy, type = "taxa", 
         distance = "(A+B-2*J)/(A+B)", maxdist = 0.1, laymeth="circle", 
         color="Phylum", point_label="Class") +
  ggtitle("Phyla (circular layout)")

Differential abundance with ancombc2

ANCOMBC package implements a method for differential abundance analysis, which takes into account the compositionality of microbial data: https://www.nature.com/articles/s41467-020-17041-7

Like Vegan and Phyloseq, ANCOMBC is a very popular package, and it can directly accept the Phyloseq objects.

# Lad package
library(ANCOMBC)

# Estimate differential abundances
# ~5min, multiple warnings are OK
ancom.output = ancombc2(data = BacIndAgriClassAbundNorm.phy, tax_level = "Class",
                fix_formula = "Former_landuse", 
                group = "Former_landuse", 
                p_adj_method = "holm", alpha = 0.05, pairwise = TRUE,
                prv_cut = 0.10, lib_cut = 1000, 
                n_cl = 4, verbose = TRUE)

# Clean-up
detach("package:ANCOMBC", unload=TRUE) # unload ANCOMBC package

Extract results from ancombc2 output

The ancombc2 output contains many components that we don’t need for the downstream analysis. We only need the table with results. Moreover, within this table we will only use selected columns describing Former Landuse (ignoring the data about intercept).

The columns that we need are:

  • taxon: the taxa (at Class level in our case)
  • lfc_Former_landuseIndustrial: Log Fold Change (Industrial vs Agriculture)
  • se_Former_landuseIndustrial: Standard error for LFC
  • W_Former_landuseIndustrial: The statistics calculated by ANCOMBC
  • p_Former_landuseIndustrial: p-value for the statistics (w/o the multiple testing correction)
  • q_Former_landuseIndustrial: significance after the multiple testing correction
  • diff_Former_landuseIndustrial: Is this taxon differentially abundant ? (TRUE/FALSE)
  • passed_ss_Former_landuseIndustrial: Has the taxon passed the additional ANCOMBC check? (TRUE/FALSE)
# Extract the results table
results.df <- ancom.output$res
colnames(results.df)
##  [1] "taxon"                              "lfc_(Intercept)"                    "lfc_Former_landuseIndustrial"       "se_(Intercept)"                     "se_Former_landuseIndustrial"        "W_(Intercept)"                      "W_Former_landuseIndustrial"         "p_(Intercept)"                      "p_Former_landuseIndustrial"         "q_(Intercept)"                      "q_Former_landuseIndustrial"         "diff_(Intercept)"                   "diff_Former_landuseIndustrial"      "passed_ss_(Intercept)"              "passed_ss_Former_landuseIndustrial"
dim(results.df) # Taxa at Class level
## [1] 76 15
# Select only columns that we need 
all_results.df <- results.df %>% 
  select(taxon, contains("Former_landuse"))
colnames(all_results.df)
## [1] "taxon"                              "lfc_Former_landuseIndustrial"       "se_Former_landuseIndustrial"        "W_Former_landuseIndustrial"         "p_Former_landuseIndustrial"         "q_Former_landuseIndustrial"         "diff_Former_landuseIndustrial"      "passed_ss_Former_landuseIndustrial"
dim(all_results.df)
## [1] 76  8
head(all_results.df)
##              taxon lfc_Former_landuseIndustrial se_Former_landuseIndustrial W_Former_landuseIndustrial p_Former_landuseIndustrial q_Former_landuseIndustrial diff_Former_landuseIndustrial passed_ss_Former_landuseIndustrial
## 1 Verrucomicrobiae                 -0.835731485                  0.08234398               -10.14927295               5.545994e-21               4.159495e-19                          TRUE                               TRUE
## 2             Mor1                  0.221208208                  0.10531749                 2.10039382               3.682299e-02               1.000000e+00                         FALSE                               TRUE
## 3    Phycisphaerae                  0.055095876                  0.09080649                 0.60673940               5.444858e-01               1.000000e+00                         FALSE                               TRUE
## 4   Planctomycetia                 -0.473222220                  0.08363992                -5.65785098               3.597050e-08               2.374053e-06                          TRUE                               TRUE
## 5           B15-G4                 -0.077196631                  0.09856533                -0.78320270               4.342676e-01               1.000000e+00                         FALSE                               TRUE
## 6          UBA8108                  0.003872749                  0.07839781                 0.04939868               9.607713e-01               1.000000e+00                         FALSE                              FALSE
# Clean-up
rm(results.df, ancom.output)

Plot differentially abundant taxa

Only the taxa that passed the additional ANCOMBC check should be considered as differentially abundant.

# Select differentially abundant taxa
dif_results.df <- all_results.df %>% 
  filter(diff_Former_landuseIndustrial & passed_ss_Former_landuseIndustrial) %>% 
  arrange(desc(lfc_Former_landuseIndustrial))
dim(dif_results.df) # 18 differentially abundant taxa
## [1] 18  8
dif_results.df
##                  taxon lfc_Former_landuseIndustrial se_Former_landuseIndustrial W_Former_landuseIndustrial p_Former_landuseIndustrial q_Former_landuseIndustrial diff_Former_landuseIndustrial passed_ss_Former_landuseIndustrial
## 1       Blastocatellia                    0.8541556                  0.11216210                   7.615368               4.024883e-13               2.857667e-11                          TRUE                               TRUE
## 2         Chloroflexia                    0.6242052                  0.12587898                   4.958772               1.197188e-06               7.063409e-05                          TRUE                               TRUE
## 3              UBA9160                    0.5836522                  0.13531146                   4.313398               2.432357e-05               1.313473e-03                          TRUE                               TRUE
## 4       Limnocylindria                    0.3712548                  0.08937539                   4.153882               4.273923e-05               2.265179e-03                          TRUE                               TRUE
## 5  Gammaproteobacteria                    0.2580406                  0.06908612                   3.735057               2.248424e-04               1.169180e-02                          TRUE                               TRUE
## 6   Desulfitobacteriia                   -0.3413243                  0.09549417                  -3.574294               4.514943e-04               2.257471e-02                          TRUE                               TRUE
## 7  Alphaproteobacteria                   -0.3712055                  0.07300500                  -5.084658               6.520492e-07               4.042705e-05                          TRUE                               TRUE
## 8       Cyanobacteriia                   -0.3921831                  0.08413241                  -4.661498               8.651602e-06               4.844897e-04                          TRUE                               TRUE
## 9       Planctomycetia                   -0.4732222                  0.08363992                  -5.657851               3.597050e-08               2.374053e-06                          TRUE                               TRUE
## 10         Nitrospiria                   -0.5092767                  0.10193599                  -4.996044               9.989573e-07               5.993744e-05                          TRUE                               TRUE
## 11          Holophagae                   -0.5703991                  0.09434919                  -6.045618               1.263810e-08               8.467527e-07                          TRUE                               TRUE
## 12    Polyangia_437813                   -0.5730558                  0.09543608                  -6.004603               8.778190e-09               5.969169e-07                          TRUE                               TRUE
## 13       Negativicutes                   -0.5822849                  0.09555270                  -6.093861               4.734435e-09               3.266760e-07                          TRUE                               TRUE
## 14      Acidobacteriae                   -0.6519240                  0.12569485                  -5.186561               3.963850e-07               2.533303e-05                          TRUE                               TRUE
## 15    Verrucomicrobiae                   -0.8357315                  0.08234398                 -10.149273               5.545994e-21               4.159495e-19                          TRUE                               TRUE
## 16       Rubrobacteria                   -0.9065683                  0.08133051                 -11.146718               1.152850e-14               8.300523e-13                          TRUE                               TRUE
## 17   Clostridia_258483                   -1.1044127                  0.12053246                  -9.162783               8.569081e-18               6.341120e-16                          TRUE                               TRUE
## 18             Bacilli                   -1.7215222                  0.10558230                 -16.305026               3.635168e-43               2.762728e-41                          TRUE                               TRUE
# Add to the data frame the column with direction of change
dif_results.df = dif_results.df %>%
    mutate(direct = ifelse(lfc_Former_landuseIndustrial > 0, "Positive LFC", "Negative LFC"))

# Update factors levels (may affect plot's layout)
dif_results.df$direct = 
  factor(dif_results.df$direct, levels = c("Positive LFC", "Negative LFC"))
dif_results.df$taxon = 
  factor(dif_results.df$taxon, levels = dif_results.df$taxon)
  
# Make plot
dif_results.df %>%
    ggplot(aes(x = taxon, y = lfc_Former_landuseIndustrial, fill = direct)) + 
    geom_bar(stat = "identity", width = 0.7, color = "black", 
             position = position_dodge(width = 0.4)) +
    geom_errorbar(aes(ymin = lfc_Former_landuseIndustrial - se_Former_landuseIndustrial, 
                      ymax = lfc_Former_landuseIndustrial + se_Former_landuseIndustrial), 
                  width = 0.2, position = position_dodge(0.05), 
                  color = "black") + 
    labs(x = "Class", y = "Log fold change") +  
    ggtitle(label = "Differentially abundant taxa", 
            subtitle="Prior land use: Industrial vs Agricultural") +
    scale_fill_discrete(name = NULL) +
    scale_color_discrete(name = NULL) +
    theme_bw() + 
    theme(panel.grid.minor.y = element_blank(),
          axis.text.x = element_text(size = 6, angle = 60, hjust = 1))

# Clean-up
rm(all_results.df)

A number of microbial classes showed significantly different abundance between Post-Industrial and Post-Agricultural soil samples. This result confirms the previous Beta diversity findings suggesting that the prior lend use has a strong effect on soil microbial community in RestREco dataset.

Microbial biomarker for prior land use

Now we will build a model to detect the prior land use from the microbial community composition.

During this course you have already learned different ML algorithms that could be used for such classification task. For instance, you could use the differentially abundant taxa in the logistic regression (because we only have two classes: Agricultural and Industrial use). However, my personal opinion is that Random Forest (RF) might perform better for this dataset. In real life, you would build models using several different techniques, and compare their performance.

Because RF can detect the features that best perform for classification, we will use the entire dataset, rather then the differentially abundant taxa only.

Required libraries

library(ggplotify)
library(ggpubr)
library(mixOmics)
library(caret)
library(randomForest)
library(parallelMap)

Prepare data

Extract data from Phyloseq object

counts.df <- data.frame(otu_table(BacIndAgriClassAbundNorm.phy))
samples.df <- data.frame(sample_data(BacIndAgriClassAbundNorm.phy))
taxa.df <- data.frame(tax_table(BacIndAgriClassAbundNorm.phy))

Reshape data

Use taxa as row names for counts

# Confirm that row names in taxa and counts are the same
all(rownames(counts.df) == rownames(taxa.df))
## [1] TRUE
# Assign new rownames to counts
rownames(counts.df) <- taxa.df$Class

# Clean-up
rm(taxa.df)

Transpose counts

The variables (taxa) should be the columns and the samples should be the rows for the downstream analysis.

counts.df <- t(counts.df)
counts.df <- as.data.frame(counts.df)

Add class column to the counts

An additional column with class should be added to counts for the downstream analysis.

# Confirm that row names in samples and (transposed) counts are the same
all(rownames(counts.df) == rownames(samples.df))
## [1] TRUE
# Add the column with class to counts data frame
data <- data.frame(class=samples.df$Former_landuse,counts.df)

# Clean-up
rm(samples.df, counts.df)

Train model

Make Training and Validation sets

Caret package provides convenience functions to split dataset into training and validation sub-sets, while preserving the proportions of the classes.

# count number of samples in each class
table(data$class)
## 
## Agriculture  Industrial 
##         150         150
# Make index for splitting (60% training set, 40% test set)
set.seed(42)
trainIndex <- createDataPartition(data$class, p = .6, 
                                  list = FALSE, 
                                  times = 1)
# Split
trainSet <- data[ trainIndex,]
testSet  <- data[-trainIndex,]

# Check balance of classes in the test and training sets
table(trainSet$class)
## 
## Agriculture  Industrial 
##          90          90
table(testSet$class)
## 
## Agriculture  Industrial 
##          60          60
# Clean-up
rm(data,trainIndex)

Scale and center

Again, using caret convenience functions to center and scale predictors.

# Autoscaling using Caret (without the class column)
preProcValues <- caret::preProcess(trainSet[,-1], method = c("center", "scale"))

# Apply the scaling to both the training set and test set (-class)
# using the same scaling parameters (i.e. mean and sd) for both
trainTransformed <- predict(preProcValues, trainSet[,-1])
testTransformed <- predict(preProcValues, testSet[,-1])

# Return class column to the transformed data
trainTransformed <- cbind(class=trainSet$class, trainTransformed)
testTransformed <- cbind(class=testSet$class, testTransformed)

# Clean-up
rm(preProcValues, trainSet, testSet)

Train

The code chunk below use caret interface to RF, which allows easy syntax to set training parameters. It is expected that you already practiced caret during this course before. Also, you may see references to caret and RF at the end of this practical handout.

For the sake of time, we use only a small dataset for this practical session. So the training will be quick (also, RF is a relatively fast ML method). However, in real life training of ML models may take long time. So parallelization becomes important. Here we use parallelMap R package that is well integrated with caret for training ML models.

# Set Parameter Tuning
control <- trainControl(method='repeatedcv', number=10, repeats=10, search='grid')

# Alternate Tuning Grids
grid <- expand.grid(mtry= seq(10, 200, 10))

# Start parallelization
parallelStartSocket(cpus = parallel::detectCores())
  
# train model (this step takes a bit of time)
# class has to be set as factor
model_rf <- train(as.factor(class)~., data=trainTransformed,
                  method="rf", metric = 'Accuracy',
                  ntrees = 800, tuneGrid = grid,
                  trControl = control )

# Stop parallelization
parallelStop()

# Check results
model_rf
## Random Forest 
## 
## 180 samples
##  76 predictor
##   2 classes: 'Agriculture', 'Industrial' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 10 times) 
## Summary of sample sizes: 162, 162, 162, 162, 162, 162, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    10   0.9566667  0.9133333
##    20   0.9461111  0.8922222
##    30   0.9433333  0.8866667
##    40   0.9333333  0.8666667
##    50   0.9311111  0.8622222
##    60   0.9244444  0.8488889
##    70   0.9233333  0.8466667
##    80   0.9238889  0.8477778
##    90   0.9238889  0.8477778
##   100   0.9222222  0.8444444
##   110   0.9233333  0.8466667
##   120   0.9233333  0.8466667
##   130   0.9227778  0.8455556
##   140   0.9227778  0.8455556
##   150   0.9227778  0.8455556
##   160   0.9227778  0.8455556
##   170   0.9227778  0.8455556
##   180   0.9211111  0.8422222
##   190   0.9205556  0.8411111
##   200   0.9244444  0.8488889
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 10.
# Clean-up
rm(control, grid, trainTransformed)

Evaluate the model

Now we have a forest of decision trees, that are supposed to predict Prior Land Use from the abundance of microbial classes detected in the soil samples.

How can we evaluate that these RF predictions are true?

Good that we have the test set that we initially reserved in our data: it was not used for training, and we know the true labels in this test set. So, we may use the microbial abundancies to predict the classes in the test set, and then compare the predicted classes with the true ones. Such comparison is called evaluting confusion matrix”.

# Predict classes from microbial taxes in the test set
predict.rf <- predict(model_rf, testTransformed)

# Calculate confusion matrix (note that class should be set as factor)
confusion.matrix <- confusionMatrix(predict.rf, 
                                    as.factor(testTransformed$class),
                                    positive="Industrial")
# Show the confusion matrix
confusion.matrix
## Confusion Matrix and Statistics
## 
##              Reference
## Prediction    Agriculture Industrial
##   Agriculture          57          2
##   Industrial            3         58
##                                           
##                Accuracy : 0.9583          
##                  95% CI : (0.9054, 0.9863)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9167          
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.9667          
##             Specificity : 0.9500          
##          Pos Pred Value : 0.9508          
##          Neg Pred Value : 0.9661          
##              Prevalence : 0.5000          
##          Detection Rate : 0.4833          
##    Detection Prevalence : 0.5083          
##       Balanced Accuracy : 0.9583          
##                                           
##        'Positive' Class : Industrial      
## 
# Clean-up
rm(predict.rf, testTransformed)

Has our RF model performed well?

Most informative taxa

One of the advantages of the RF method is that it allows to estimate how different predictors contributed into the final classification. Thus, each Random Tree in the Forest puts all predictors in the order of their importance within this tree. Then, the average rank of a predictor over all the Trees gives its overall rank for the entire Forest.

The code chunk below illustrates how this information can be extracted from the RF model that we have built.

# Get taxa importance from the model
varImp.res <- varImp(model_rf)
varImp.df <- varImp.res$importance

# Look at the top  informative taxa
varImp.df %>% 
  arrange(desc(Overall)) %>% 
  head(10)
##                          Overall
## Bacilli               100.000000
## Gammaproteobacteria    52.067749
## Verrucomicrobiae       34.110343
## Polyangia_463783       27.421908
## Limnocylindria         22.528468
## Clostridia_258483      20.283050
## Ktedonobacteria        17.939605
## Blastocatellia         17.136488
## Actinomycetia          14.267658
## Acidimicrobiia_401430   9.974571
# Order the taxa by importance (for plotting)
varImp.df <- cbind(taxa=rownames(varImp.df), varImp.df)
varImp.df <- varImp.df[ order(varImp.df$Overall), ]

# Make the plot
ggbarplot(varImp.df, "taxa", "Overall", 
          orientation = "horiz",fill = "#619CFF") +
  theme_bw()

# Clean-up
rm(varImp.res)

How many of the top 10 taxa most informative for RF model are differentially expressed by ANCOMBC2?

Congratulation!

You have completed this practical session!

Session Info

ls()
## [1] "BacIndAgriClassAbundNorm.phy" "confusion.matrix"             "dif_results.df"               "model_rf"                     "varImp.df"
sessionInfo()
## R version 4.3.1 (2023-06-16 ucrt)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19045)
## 
## Matrix products: default
## 
## 
## locale:
## [1] LC_COLLATE=English_United Kingdom.utf8  LC_CTYPE=English_United Kingdom.utf8    LC_MONETARY=English_United Kingdom.utf8 LC_NUMERIC=C                            LC_TIME=English_United Kingdom.utf8    
## 
## time zone: Europe/London
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] parallelMap_1.5.1    randomForest_4.7-1.1 caret_6.0-94         mixOmics_6.24.0      MASS_7.3-60          ggpubr_0.6.0         ggplotify_0.1.2      vegan_2.6-4          lattice_0.21-8       permute_0.9-7        dendextend_1.17.1    Polychrome_1.5.1     phyloseq_1.44.0      ggplot2_3.5.0        dplyr_1.1.2         
## 
## loaded via a namespace (and not attached):
##   [1] fs_1.6.3                       matrixStats_1.2.0              bitops_1.0-7                   DirichletMultinomial_1.42.0    lubridate_1.9.3                RColorBrewer_1.1-3             httr_1.4.7                     doParallel_1.0.17              numDeriv_2016.8-1.1            tools_4.3.1                    doRNG_1.8.6                    backports_1.4.1                utf8_1.2.3                     R6_2.5.1                       DT_0.32                        lazyeval_0.2.2                 mgcv_1.9-0                     rhdf5filters_1.12.1            withr_3.0.0                    gridExtra_2.3                  cli_3.6.1                      Biobase_2.60.0                 sandwich_3.1-0                 labeling_0.4.3                 sass_0.4.8                     mvtnorm_1.2-4                  proxy_0.4-27                   yulab.utils_0.1.4              foreign_0.8-84                 scater_1.28.0                  parallelly_1.37.1              decontam_1.20.0               
##  [33] readxl_1.4.3                   rstudioapi_0.15.0              RSQLite_2.3.5                  generics_0.1.3                 gridGraphics_0.5-1             gtools_3.9.5                   car_3.1-2                      Matrix_1.6-0                   biomformat_1.28.0              ggbeeswarm_0.7.2               fansi_1.0.4                    DescTools_0.99.54              S4Vectors_0.38.1               DECIPHER_2.28.0                abind_1.4-5                    lifecycle_1.0.4                scatterplot3d_0.3-44           multcomp_1.4-25                yaml_2.3.7                     carData_3.0-5                  SummarizedExperiment_1.30.2    recipes_1.0.10                 rhdf5_2.44.0                   grid_4.3.1                     blob_1.2.4                     crayon_1.5.2                   beachmat_2.16.0                pillar_1.9.0                   knitr_1.45                     GenomicRanges_1.52.1           boot_1.3-28.1                  gld_2.6.6                     
##  [65] corpcor_1.6.10                 future.apply_1.11.1            codetools_0.2-19               glue_1.6.2                     data.table_1.14.8              MultiAssayExperiment_1.26.0    vctrs_0.6.3                    png_0.1-8                      treeio_1.24.3                  Rdpack_2.6                     cellranger_1.1.0               gtable_0.3.4                   cachem_1.0.8                   gower_1.0.1                    xfun_0.39                      prodlim_2023.08.28             rbibutils_2.2.16               S4Arrays_1.0.6                 survival_3.5-5                 timeDate_4032.109              SingleCellExperiment_1.22.0    iterators_1.0.14               hardhat_1.3.1                  lava_1.7.3                     gmp_0.7-4                      TH.data_1.1-2                  ipred_0.9-14                   nlme_3.1-162                   bit64_4.0.5                    GenomeInfoDb_1.36.4            bslib_0.6.1                    irlba_2.3.5.1                 
##  [97] vipor_0.4.7                    rpart_4.1.19                   colorspace_2.1-0               BiocGenerics_0.46.0            DBI_1.2.2                      Hmisc_5.1-1                    nnet_7.3-19                    ade4_1.7-22                    NADA_1.6-1.1                   Exact_3.2                      tidyselect_1.2.0               bit_4.0.5                      compiler_4.3.1                 htmlTable_2.4.2                BiocNeighbors_1.18.0           expm_0.999-9                   DelayedArray_0.26.7            checkmate_2.3.1                scales_1.3.0                   stringr_1.5.1                  digest_0.6.33                  minqa_1.2.6                    rmarkdown_2.25                 XVector_0.40.0                 htmltools_0.5.7                pkgconfig_2.0.3                base64enc_0.1-3                lme4_1.1-35.1                  sparseMatrixStats_1.12.2       MatrixGenerics_1.12.3          highr_0.10                     fastmap_1.1.1                 
## [129] rlang_1.1.1                    htmlwidgets_1.6.4              BBmisc_1.13                    zCompositions_1.5.0-1          DelayedMatrixStats_1.22.6      farver_2.1.1                   jquerylib_0.1.4                zoo_1.8-12                     jsonlite_1.8.7                 energy_1.7-11                  BiocParallel_1.34.2            ModelMetrics_1.2.2.2           BiocSingular_1.16.0            RCurl_1.98-1.12                magrittr_2.0.3                 Formula_1.2-5                  scuttle_1.10.3                 GenomeInfoDbData_1.2.10        Rhdf5lib_1.22.0                munsell_0.5.0                  Rcpp_1.0.11                    ape_5.7-1                      viridis_0.6.5                  CVXR_1.0-12                    pROC_1.18.5                    stringi_1.7.12                 rootSolve_1.8.2.4              zlibbioc_1.46.0                plyr_1.8.8                     listenv_0.9.1                  parallel_4.3.1                 ggrepel_0.9.5                 
## [161] lmom_3.0                       Biostrings_2.68.1              splines_4.3.1                  multtest_2.56.0                igraph_1.5.0                   ggsignif_0.6.4                 rngtools_1.5.2                 reshape2_1.4.4                 stats4_4.3.1                   ScaledMatrix_1.8.1             evaluate_0.23                  nloptr_2.0.3                   foreach_1.5.2                  tidyr_1.3.0                    purrr_1.0.1                    future_1.33.1                  rsvd_1.0.5                     broom_1.0.5                    Rmpfr_0.9-5                    RSpectra_0.16-1                e1071_1.7-14                   tidytree_0.4.6                 rstatix_0.7.2                  viridisLite_0.4.2              class_7.3-22                   gsl_2.1-8                      rARPACK_0.11-0                 truncnorm_1.0-9                tibble_3.2.1                   lmerTest_3.1-3                 ellipse_0.5.0                  memoise_2.0.1                 
## [193] beeswarm_0.4.0                 IRanges_2.34.1                 cluster_2.1.4                  TreeSummarizedExperiment_2.8.0 timechange_0.3.0               globals_0.16.2                 mia_1.8.0
date()
## [1] "Tue Mar 19 20:19:27 2024"